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On the genre-fication of Music: a percolation approach (long version) 
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In this paper, we analyze web-downloaded data on people sharing their music library. By at- 
tributing to each music group usual music genres (Rock, Pop...), and analysing correlations between 
music groups of different genres with percolation-idea based methods, we probe the reality of these 
subdivisions and construct a music genre cartography, with a tree representation. We also show the 
diversity of music genres with Shannon entropy arguments, and discuss an alternative objective way 
to classify music, that is based on the complex structure of the groups audience. Finally, a link is 
drawn with the theory of hidden variables in complex networks. 
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I. INTRODUCTION 

Take a sample of people, and make them listen to a 
list of songs. If a majority of people should find an agree- 
ment on basic subdivisions, like Rock/Jazz/Pop..., a more 
refine description will lead to more and more disparate 
answers, even contradictions. These originate from the 
different background, taste, music knowledge, mood or 
network of acquaintances of the listeners, i.e. in a sta- 
tistical physics description, these processes correspond 
to ageing, internal fluctuations and neighbour-neighbour 
interactions. The more and more eclectic music offer, to- 
gether with the constant mixing of old music genres into 
new ones make the problem still more complicated. Even 
artists seem to avoid the usual classifications by refus- 
ing to enter well-defined yokes, and prefer to characterise 
themselves as a unique mix-up of their old influences . 

Obviously, categorising music, especially into finer gen- 
res or subgenres, is not an easy task, and is strongly 
subjective. This task is also complicated by the con- 
stant birth of newly emerging styles, and by the very 
large number of existing sub-divisions. For instance, the 
genre Electronic music is divided in wikipedia into 9 
sub-genres (Ambient, Breakbeat...), each of them being 
divided into several subsubgenres. This categorising is 
becoming more and more complex in the course of time. 

This paper tries to find an answer to the above prob- 
lems by showing in an "objective" way the existence of 
music trends that allow to classify music groups, as well 
as the relations between the usual genres and sub-genres. 
To do so, we use web-downloaded data from the web, 
and define classifications based on the listening habits of 
the groups audience. Thereby, we account for the fact 
that music perception is driven both by the people who 
make music (artists, Majors...), but also by the people 
who listen to it. Our analysis consists in characterising 
a large sample of individual musical habits from a sta- 
tistical physics point of view, and in extracting collective 
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trends. In a previous work £|, we have shown that such 
collective listening habits may lead to the usual music 
subdivisions in some particular cases, but also to unex- 
pected structures that do not fit the neat usual genres 
defined by the music industry. Those represent the non- 
conventional taste of listeners. Let us note that alter- 
native music classifications based on signal analysis may 
also be considered 0,0. 

In section II, we describe the methodology, namely 
the analysis of empirical data from collaborative filtering 
websites, e.g. audioscrobbler.com and musicmobs.com. 
We will also give a short review of the statistical meth- 
ods introduced in Q. Mainly, these methods consist in 
evaluating the correlations between the groups, depend- 
ing on their audience, and in using filtering methods, i.e. 
percolation idea-based (PIB) methods, in order to visu- 
alise the collective behaviours. In section III, we attribute 
lists of genres to a sample of music groups, by download- 
ing data from the web. These data, that describe the 
different tags, i.e. genres, used by people to classify mu- 
sic groups, are analysed by using the Shannon entropy 
as a measure of the music group diversity. By examin- 
ing correlations between these different music genres, we 
also use the statistical methods of section II in order to 
make a map of music genres (see for an example from 
the social science). This cartography is justified by the 
fact that alike music genres are statistically correlated 
by their audience. It is shown that these correlations are 
homophilic ||, be. alike music genres tend to be listened 
to by the same audience. Homophily is known to occur 
in many social systems, including online communities 
co-authorship networks frft ^| and linking patterns be- 
tween political bloggers ]l2| . 

Let us stress that the issues of this work are part of the 
intense ongoing ph ysicist research activity on opinion for- 
mation (la. 11 1! ITM 1 161 [Tj| , self organisation on networks 
0,0], including clique formation 0], percolation tran- 
sitions plj . as well as on the identification of a priori 
unknown collective behaviours in complex networks |'2'2| , 
e.g. proteins J0, genes 01 > linguistics 0,0], indus- 
trial sectors 0]7 groups of people 0]... 
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FIG. 1: Branching representation of a squared correlation 
matrix of 13 elements. At each increasing step (t=0,l,2) of 
the filter (j>, links are removed, so that the network decom- 
poses into isolated islands. These islands are represented by 
squares, whose size depends on the number of nodes in the 
island. Islands composed by only one music group are not 
depicted. Starting from the largest island, branches indicate 
a parent relation between the islands. The increasing filter 
method is applied until all links are removed. 



II. METHODOLOGY 
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A. Data analysis 



FIG. 2: Empirical probability histogram of the top 10 genres 
tagged by listeners to ABBA (a), and to John Coltrane (b). 
The data have been downloaded from http://www.lastfm.com 
in August 2005. 



In this work, we analyze data retrieved from collab- 
orative filtering websites (see (2t| for a detailed defini- 
tion). These sites propose people to share their profiles 
and experiences in order to help them discover new mu- 
sics/books... that should (statistically) correspond to 
their own taste. In the present case, we focus on a 
database downloaded from audioscrobbler.com in Jan- 
uary 2005. It consists of a listing of users (each rep- 
resented by a number), together with the list of music 
groups the users own in their library. This structure di- 
rectly leads to a bipartite network for the whole system. 
Namely, it is a network composed by two kinds of nodes, 
i.e. the persons, called users or listeners in the following, 
and the music groups. The network can be represented 
by a graph with edges running between a group i and a 
user /i, if \x owns i. 

In the original data set, there are 617900 different mu- 
sic groups, although this value is skewed due to multi- 
ple (even erroneous) ways for a user to characterise an 
artist (e.g. The Beatles, Beatles and The Beetles count 
as three music groups) and 35916 users. On average, 
each user owns 140 music groups in his/her library, while 
each group is owned by 8 persons. For completeness, let 
us note that the listener with the most groups possesses 
4072 groups (0.6% of the total music library) while the 
group with the largest audience, Radiohead, has 10194 



users (28% of the user community). This asymmetry in 
the bipartite network is expected as users have in gen- 
eral specific tastes that prevent them from listening to 
any kind of music, while there exist mainstream groups 
that are listened to by a very large audience. Let us stress 
that this asymmetry is also observable in the degree dis- 
tributions for the people and for the groups. 

In the following, we make a selection in the total num- 
ber of groups for computational reasons, namely we have 
analysed a subset composed of the top 1000 most-owned 
groups. This limited choice was also motivated by the 
possibility to identify these groups at first sight. 

B. Percolation idea-based filtering 

In this section, we review the method introduced in 
14| in order to extract collective structures from the data 
set. Each music group i is characterised by its signature, 
i.e. the vector: 

r i.... i o I....? (i) 

of ul components, where nj, = 35916 is the total number 
of users in the system, and where IT = 1 if the listener 
\x owns group i and = otherwise. By doing so, we 
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consider that the audience of a music group, i.e. the list 
of persons listening to it, identifies its signature. 

In order to quantify the correlations between two music 
groups i and j, we calculate the symmetric correlation 
measure: 



\t\\t\ 



cos 9; 



(2) 



where T^.l denotes the scalar product between the two 
n^-vector, and || its associated norm. This correlation 
measure, that corresponds to the cosine of the two vectors 
in the n^-dimensional space, vanishes when the groups 
are owned by disconnected audiences, and is equal to 1 
when their audiences strictly overlap. 

In order to extract families of alike music groups from 
the correlation matrix C u , we use the PIB method 
We define the filter coefficient <fi £ [0, 1[, and filter the 
matrix elements so that C 1 ^ = 1 if C y > (j), and let 

C? = otherwise. Starting from (j) — 0.0, namely a 
fully connected network, increasing values of the filtering 
coefficient remove less correlated links and lead to the 
shaping of well-defined islands, completely disconnected 
from the main island. Let us stress that this systematic 
removal of links is directly related to percolation theory, 
and that the internal correlations in the network displace 
and broaden the percolation transition 0, l3^ | . From a 
statistical physics point of view, the meaning of <f> is that 
of the inverse of a temperature, i.e. high values of <p 
restrain the system to highly correlated islands; in the 
same way, low temperature restrains phase space explo- 
ration to low lying free energy wells. This observation 
suggests that PIB methods should be helpful in visualis- 
ing free energy profiles and reaction coordinates between 
metastable states [30|. 

A branching representation of the community struc- 
turing is used to visualise the process (see Fig^for the 
sketch of three first steps of an arbitrary example). To 
do so, we start the procedure with the lowest value of 
4> = 0.0, and we represent each isolated island by a square 
whose surface is proportional to its number of nodes (the 
music groups). Then, we increase slightly the value of 
4>, e.g. by 0.01, and we repeat the procedure. From one 
step to the next step, we draw a bond between emerg- 
ing sub-islands and their parent island. The filter is in- 
creased until all bonds between nodes are eroded (that 
is, there is only one node left in each island). Let us note 
that islands composed by only one music group are not 
depicted, as these lonely music groups are self-excluded 
from the network structure, whence from any genre. Ap- 
plied to the above correlation matrix C y , the tree struc- 
ture gives some insight into the diversification process by 
following branches from their source (top of the figure) 
toward their extremity (bottom of the figure) . The longer 
a given branch is followed, the more likely it is forming 
a well-defined music genre. 

In 0, we have shown that the resulting tree represen- 
tation exhibits long persisting branches, some of them 
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FIG. 3: Empirical probability histogram of the relative en- 
tropy Ri (see text for definition), obtained for the top 1000 
music groups. The tagged genres have been downloaded from 
http://www.lastfm.com in August 2005. 



leading to standard, homogenous style groupings, such 
as [Kylie Minogue, Dannii Minogue, Sophie Ellis Bextor] 
(dance pop), while many other islands are harder to ex- 
plain from a standard genre-fication point of view and 
reveal evidence of unexpected collective listening habits. 



III. GENRE CARTOGRAPHY 
A. Measure of diversity 

In view of the above analysis, attributing genres to mu- 
sic groups is a difficult problem. This complexity is made 
clearer by observing the different ways listeners charac- 
terise the same music group. To perform this analysis, 
we have downloaded from http://www.lastfm.com a list 
of the descriptions, i.e. genres, that people tag to music 
groups in their music library, together with the number 
of times this description occurred. For instance, from 
this site, one gets that ABBA (Fig|3i) is described by an 
eclectic range of different music sub-divisions. These sub- 
divisions are based on the group style (Pop, Rock...), on 
the time period (80s, 70s...) or on geographical grounds 
(Swedish) and their choice depends on the listener, i.e. 
his perception and subjective way to characterise music 
(see first paragraph of the introduction). 

For this work, we have downloaded these lists of genres 
for the top 1000 groups, thereby empirically collecting 
a statistical genre-fication of the music groups. Let us 
stress that the data could not be downloaded for 5 of 
the groups, due to misprints in their name, e.g. Bjadrk 
instead of Bjdrk. Consequently, we focus in the following 
on the no = 995 remaining music groups. One should 
also note that http://www.lastfm.com limits access to 
the top 25 genres of each group. 

The statistical genre-fication of the sample may ex- 



FIG. 4: Graph representation of the music genres filtered correlation matrix M M1M2 for 3 values of the filter parameter cj> = 0.09, 
4> = 0.12 and (j> = 0.15, displayed from left to right. Rectangles represent the genres observed in the sample of 995 music groups. 
The action of filtering leads to a removal of less correlated links, thereby exhibiting the internal structure of the network. The 
graphs were plotted thanks to the visone graphical tools [3ll| . 



hibit quantitatively different behaviours. For instance, a 
music group like John Coltrane (Fig(2t>) shows a peaked 
histogram, i.e. it is almost only described by the tag 
jazz, in contrast with ABBA that is described by a large 
variety of tags. In order to measure the complexity, or di- 
versity of each music group i, we introduce the Shannon 
entropy |33j : 



(3) 



where pi- g is the probability for genre g to be tagged to 
the music group i, and the sum is performed over all pos- 
sible genres (with, as said before, a maximum of 25). By 
construction, this quantity vanishes 5™ m = when the 
group i is wholly described by one tag g* , i.e. p, L . g = 
while it takes its maximum value S^ nax = In 25 for the 
uniform distribution pi- g — In order to restrain the 
problem to the interval [0:1], we introduce the relative 
quantity Ri = 5^5- This quantity is therefore represen- 
tative of the number of different terms needed by listen- 
ers to describe the music group i, i.e. the diversity of 
the music group. In figure we plot the empirical dis- 
tribution of this relative entropy over the 995 considered 
groups. It shows clearly a high degree of diversity of the 
music groups, therefore requesting many different tags 
for characterisation. 



B. Genres correlations 

In this section, we use the methods of section IIB in or- 
der to analyse the correlations between genres attributed 
to each music group i. In the data set, we find 2394 
different music genres. Nonetheless, in order to remove 



irrelevant tags (due to to misprints for instance) and to 
simplify our analysis, we restrict the scope to all music 
genres that have been attributed to at least 20 music 
groups. There are 142 such music genres, that we label 
with index 7 6 [1, 142]. Let us note G 7 this list of gen- 
res, and Pi- j their probability for the music group i. For 
instance, these notations read as follows in the case of 
John Coltrane: 



G = 
Pj.c. = 



,jazz, saxophone, . 
,0.72,. ..,0.06,. ..,0.02, 



, free jazz, . 



(4) 



In order to measure correlations between the 142 music 
genres, we define the 142 x 142 correlation matrix M, 
based on the correlations C between the music groups 
(see Eq.2): 



M 7172 = 



where N is a normalisation matrix: 

i j=£i 



(5) 



(G) 



Practically, we make a loop over the na(nc ~ 1) pairs of 
different music groups (i, j) , each pair being characterised 
by the correlation coefficient C % K For each of these pairs, 
we evaluate all pairs of music genres 71 and 72 such that 
Pv-ni 7^ an d Pj;i2 7^ 0, and increase the matrix element 
M 7172 by the quantity Pi ni Pj n2 . The normalisation 
matrix element TV 7172 is itself updated by Pi ni Pj ;l2 . At 

the end of the loops, the correlation matrix is normalised: 

/Vf7l72 > M21 72 

In order to reveal collective behaviours from the cor- 
relation matrix M 7172 , we apply PIB methods. Starting 
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FIG. 5: Branching representation of the correlation matrix M 7172 . The filtering parameter (f> ranges from 0.05 to 0.25 (from 
top to bottom), and is increased at each step by 0.01 (the tree length is 20 steps) . It induces a snake of squares at each filtering 
level. The shape of the snake as well as its direction are irrelevant. The tree obviously shows the emergence of homogeneous 
branches, that are composed of alike music-subdivisions, thereby showing evidence of genre families. The first island extraction 
occurs at = 0.1, and corresponds to a family of genres related to Japanese music: [japanese, jpop, j-rock]. Among the different 
structures uncovered by the method, let us note the appearance of the islands 7i (<j> = 0.15), I2 (</> = 0.15) and 73 (<f> = 0.18) 
described in the main text. 



at a very low value of the filtering coefficient (sec FigQJ, 
say <f> = 0.09, the graph is fully connected. Increasing 
values of the filtering coefficient lead to the formation of 
cliques and to the emergence of disconnected islands, as 
those occurring in Q. Finally, we plot in Fig|S]the tree 
representation of the filtering process. Poring over the 
branches of this tree is very instructive and confirms the 
existence of non-trivial correlations between the differ- 
ent music genres. These correlations shape the relations 
between genres, and give an objective definition to the 
notion of sub-genre, genre family.... 

For instance (see Fig|SJ), one observes at <j> = 0.15 
the extraction of two large sub- islands, ii and li- I\ 
is composed of genres related to Post-Rock, Brit-Rock 
and Trip-Hop: [chillout, ambient, trip hop, downtempo, 



trip-hop, idm, post-rock, post rock, shoegaze, alt-country, 
post-punk, indie pop, indie rock, lo-fi, emusic, indie, folk, 
brit rock, british, britpop, ukj. I2 is itself composed of 
Hip-Hop and R&B genres: [hip hop, hiphop, hip-hop, 
gangsta rap, rap, us hiphop, r and b, rmb}. At <f> = 0.16, a 
small sub-island extracts from ii , composed of all British 
related tags, thereby defining a new sub-genre. Finally, 
at cf> — 0.17, I\ breaks into two separated blocks, one re- 
lated to Rock music, the other related to Trip-Hop mu- 
sic. Such a breaking also occurs for 1% at <j> — 0.16, and 
leads to a Hip-Hop sub-genre and a R&B sub-genre. Fi- 
nally, let us note the punk-related island I3 emerging at 
= 0.18. 

Before concluding, one should insist on the homogene- 
ity of the above sub-islands, i.e. their composition is ra- 
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tional given our a priori knowledge of music. This feature 
highlights the homophily 8] of the music groups, which 
means that similar groups, i..e. groups with similar tags, 
tend to be listened by the same audience. 

IV. CONCLUSION 

In this article, we study empirically the musical be- 
haviours of a large sample of persons. Our analysis is 
based on web-downloaded data and uses complex net- 
work techniques in order to uncover collective trends from 
the data. To do so, we use percolation idea-based tech- 
niques £| that consist in filtering correlation matrices, i.e. 
correlations between the music groups, and in visualising 
the resulting structures by a branching representation. 
Each of the music groups is characterised by a list of 
genres, that are tags used by the listeners in order to de- 
scribe the music group. By studying correlations between 
these tags, we highlight non-trivial relations between the 
music genres. As a result, we draw a cartography of mu- 
sic, where large structures are statically uncovered and 
identified as a genre family. Let us stress that this work is 
closely related to the theory of hidden variables [3 HH i 
i.e. the hidden variables being here the music group tags. 
Consequently, this study should provide an empirical test 
for the theory. 



This work has also many applications in marketing and 
show business, e.g. taste suggestions in online services, in 
publicity, libraries.... This kind of approach also opens 
the way to quantitative modelling of opinion/taste for- 
mation |36|, and offers quantitative tools for sociologists 
and musicologists. For instance, G. d'Arcangelo [37j has 
recently used our analysis in order to discuss the emer- 
gence of a growing eclecticism of music listeners that is 
driven by curiosity and self-identification, in opposition 
to the uniform trends promoted by commercial radios 
and Major record labels 1381. A pplications should also 
be considered in taxonomy 39] , in scientometrics, i.e. 
how to classify scientific papers depending on their au- 
thors, journal, year, keywords..., and in linguistics [Zcj . 
in order to highlight relations between a signifier (tag) 
and a signified (music group). 
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